grid = dict()
grid['n_estimators'] = [100,200,500,700,1000]
grid['learning_rate'] = [0.0001,0.001, 0.01,0.1]
grid['base_estimator__max_depth'] = [11,15,19, 21]Project Report
Team Too Pullz
1 Background / Motivation
Food insecurity is prevalent across the United States, and continues to be perpetuated by a culmination of economic, geographical, and sociological factors, such as income, number of dependencies, etc. Instances of food insecurity can also be initiated by time of year, natural disasters, economic tumult, among other causes. Due to the complex, ever-changing nature of the problem, governments often grapple with the task of addressing food insecurity. According to Barrett, “if [the government] knew better the predictive accuracy of different indicators in forecasting future food security states, [they] could cost-effectively concentrate data collection measures of which targetable actions can be most reliably programmed” [1].
The principal food security support provided by the U.S. government is the Supplemental Nutrition Assistance Program. Through SNAP, monetary funds are distributed monthly to individuals who qualify, through Electronic Benefit Transfer cards. After an individual applies for SNAP benefits, the amount of aid provided to their household is determined by their gross and net income. Food assistance does not reach all those who need it: households often do not participate in SNAP due to cultural stigmas, transportation costs to a SNAP office, and inadequate benefit levels. Thus, implementing a predictive model to forecast food insecurity would better target households who are truly in need of assistance. Our group was interested in performing an analysis on the U.S. Census Bureau’s dataset detailing households’ Census survey results, as we became motivated by addressing the predictive accuracy of food insecure homes in America, and streamlining the surveying and financial allocation processes performed by the U.S. government and nonprofits. Upon analyzing the dataset, we noticed there was a second survey within the dataset whose results were used to determine food insecurity directly. We thus wondered if we could bypass the need for this second survey by finding the factors that most allude to food insecurity, and creating models with high predictive accuracy and recall. Our resulting boosting models and ensemble for the prediction of whether a household is food insecure or not aim to assess these pertinent cultural issues as well as provide our team with a more comprehensive insight to a topic we possess interest in. Our model(s) can thus be used as a basis to make decisions on how to effectively financially allocate funds and aid to houses in need, and what houses to particularly allocate aid to. Additionally, our model(s) aim to bypass the need for the second survey whose purpose is just to determine food insecurity or not, saving the U.S. government time and money that they can focus on actual aid. With this project, we will ideally mitigate the issues nonprofits and the government might have with food insecurity identification and the allocation of aid.
2 Problem statement
Driving Question: How can we best predict the food security state of U.S. households, based on the annual survey data obtained by the Census Bureau? Furthermore, how can we streamline the data collection process to efficiently classify households as either secure or insecure?
We are working to identify and accurately predict a household’s status of food insecurity based on its relationship to a collection of predictors/census responses from our data. We considered the variable detailing the food insecurity as a function of the second survey (Food Security Supplement) as our response variable. Our problem is focused on prediction—the overarching objective of ours is to best predict the food security state of U.S. households, based on the annual population survey data obtained by the Census Bureau. A secondary objective is to use predictive model(s) to streamline the data collection process and efficiently classify households as either secure or insecure. These objectives were set based on the research summarized in our background, and we aim to address these objectives using machine learning models and techniques.
3 Data sources
We used an open-source dataset from the U.S. Census Bureau website. The data contains the responses of the annual Current Population Survey, along with the Food Security Supplement Survey, which is directly used to create the response variables at the end of the dataset. Both surveys were conducted across the United States, in order to glean the range of food security experienced in U.S. households. The survey results are accompanied by an extensive documentation file, which explains all of the variable names and response meanings, and delineates the Population Survey from the Food Security Supplement Survey. Generally, each predictor corresponds to a survey question or response variable.
Our chosen dataset has over 500 predictors, which range from household demographics to economic status. Our response variable is named HRFS12CX. The values for the response are either 2 (Household did not pass common screening), 1 (Household passed common screening), and -5 (Missing no valid scale items). All predictors from the Food Security Supplement and all the extraneous response variables as well as unique ID columns were dropped from the dataset before variable selection due to either their direct correlation to the response variables, their identity as a response variable, or their inability to be generalized as a predictor in a model.
The documentation file for the datatset also gives an overview of the ways in which the data was collected: “The CPS has been conducted monthly for over 60 years. Currently, [they] obtain interviews from about 54,000 households monthly, scientifically selected on the basis of area of residence to represent the nation as a whole, individual states, and other specified areas. Each household is interviewed once a month for four consecutive months one year, and again for the corresponding time period a year later. This technique enables [them] to obtain reliable month-to-month and year-to-year comparisons at a reasonable cost while minimizing the inconvenience to any one household” [2].
Link to our data: https://www.census.gov/data/datasets/time-series/demo/cps/cps-supp_cps-repwgt/cps-food-security.html#cpssupps
4 Stakeholders
Our efforts are aimed to assist U.S. government agencies, who are responsible for addressing food insecurity through national policies and programs. As previously mentioned, SNAP is the primary government implement to address food insecurity. Insights as to which households are in need of financial assistance would help the government allocate SNAP funds appropriately and efficiently. Our model would also bypass the need for the Food Security Supplement which is primarily gathered to determine the response (labeling a house as food insecure or food secure), saving the government time and money that can be focused on pointed allocation of aid.
Non-profit organizations can also utilize our findings to identify areas of need, and target resources and households based on whether the household was predicted to be food insecure or not. Some examples of pertinent nonprofits are listed below. - Feeding America is a U.S. based organization that focuses on establishing food banks, pantries, and soup kitchens in food insecure communities. The organization is ranked as the largest U.S. charity by revenue, as of 2018. [3] - The Food Research & Action Center (FRAC) is a private, nonprofit organization that works to alleviate poverty-related hunger in the U.S. FRAC works with government agencies to strengthen public policy surrounding food insecurity. [4]
Through our project, we also hope to aid food insecure individuals receive the resources they need by predicting their household’s food insecure status accurately. This includes children who rely on government funded school lunches, individuals with limited physical mobility, and households with insufficient means to obtain adequate food.
5 Data quality check / cleaning / preparation
As noted in the documentation file, the U.S. Census Bureau directly utilized the responses from the Food Security Supplement to create the outcome variable. Including the predictors from the supplemental survey would likely result in an unrealistically high model performance, characterized by inflated performance metrics. Thus, we opted to only use predictors from the CPS to conduct our analysis, achieving a predictive accuracy that renders the second survey obsolete, saving the government money and time.
Upon considering missing values, we first noted that the dataset contains approximately 127,000 randomly sampled household addresses. The data includes households that were either vacant, could not be contacted, or refused to participate in the survey. These households exhibited missing values across all predictors and our response, which we removed from our dataset. After removing non interviews, there were 71,533 remaining observations in our dataset.
Our response variable was coded as either (1) household did not pass common screening, or (2) household passed common screening. We adjusted these numerical values to 1’s and 0’s respectively, in order to ensure interpretability and consistency throughout our model development.
6 Exploratory data analysis
Since our dataset had a large number of predictors, we first created a basic Random Forest model to determine the feature importances. The model yielded 22 predictors with feature importances that were higher than our tuned feature importance threshold of 0.01. We then conducted EDA with our selected 22 predictors, and produced the following insights:
The variables that have the strongest correlation with the response are Household Ownership, Household Type, and Family Income. Although this EDA did not directly lead to actionable items or variable tranformations (which do not benefit boosting models), we still found it relevant for deepening knowledge on the survey data and how it is performed. - Household Ownership pertains to how the household is being paid for. The ownership type containing the most survey participants is “Owned or being bought by a household member”.
- Family Income: The income bracket containing the most survey participants is “150,000 OR MORE”. Approximately 7% of the observations fall under $15,000, which was the approximate national poverty threshold for a 1-2 person household in 2021. [5] https://aspe.hhs.gov/2021-poverty-guidelines
- Number of Household Members: 31 percent of respondents live in 2 person households.
- Household Type describes the relationship of the household members to each other. Households are categorized as families or individuals, and interviewees are marked as male or female, and married or unmarried. As noted in the distribution plot, a majority of households were categorized as “Husband/Wife Primary Family”.
Overall, we noted that many of the households included in our data have an income well over the poverty level, are of at least two people, and can be classified as families (as opposed to individuals).
7 Approach
After variable selection using Ranom Forest feature importances, we developed a total of 4 models: AdaBoost, XGBoost, GradientBoosting, and RandomForest. We finally created Voting and Stacked Ensembled models based on these boosting models. We aimed to optimize the accuracy and recall metrics of our models. We focused on optimizing accuracy to ensure that our models correctly predict which households are food insecure more often than incorrectly predicting it. Furthermore, we optimized recall so that our model would capture as many food insecure instances as possible, and minimize the risk of false negatives (labeling a household as food secure when they are not). Since our models’ objective is to benefit organizations that help households struggling with food insecurity, predicting food insecure households as secure would hinder the allocation of resources to households in need.
Solutions have been previously developed for the broader topic of food security, such as research on COVID-19’s Impact on Global Food Security; however, there has not been a public solution developed for this specific data set and year. We hope that by creating a predictive model for this post-pandemic, most recent data, our model will not only explain food insecurity for 2021, but also for the coming years, assuming the political and social climate continues to be largely consistent.
8 Developing the model: Hyperparameter tuning
9 AdaBoost
By Lila Weiner
At the onset of my model development, I created a base model to obtain an understanding of AdaBoost’s performance on our data. With no hyperparameter/decision threshold tuning, the AdaBoost model exhibited an 83.51% accuracy and 26.71% recall on the test data. From here, I conducted a coarse grid search of three hyperparameters: learning_rate, base_estimator__max_depth, and n_estimators. My initial grid search was comprised of the following hyperparameter values:
The above grid search returned the following parameters as producing the best accuracy score: base_estimator__max_depth = 15, n_estimators = 1000, learning_rate = 0.01. I visualized the results of the grid search, in order to observe any trends among the hyperparameter values and their accuracy scores.
From here, I tried a second grid search, this time focusing on the optimal parameters from before. The hyperparameters from second grid search produced a higher accuracy and recall when predicting the test data. After two more rounds of hyperparameter tuning, the optimal hyperparameters were max_depth = 16, learning_rate = 0.02, and n_estimators = 700.
After completing an iterative tuning process to achieve the optimal hyperparameters with respect to the accuracy score, I began to tune the decision threshold probability with both accuracy and recall in mind. Initially, I visualized accuracy, recall and precision as functions of decision threshold probability.
The graph indicated that a threshold closer to 0 would yield an optimal balance of recall and accuracy, so I then examined the recall and accuracies of decision thresholds between 0.0 and 0.3. At the conclusion of this process, I found that 0.1 was the optimal decision threshold. With the optimal hyperparameters and threshold, the model produced an accuracy of 95.4% and recall of 80.6% on the test data.
The next step in my model development was to address any potential outliers in the data. AdaBoost is more sensitive to outliers than other boosting methods, as the shortcomings of the model are identified by the residuals (i.e. the gradient of the mean squared error loss function). Thus, I performed outlier detection using the IsolationForest algorithm. In doing this, I tuned the contamination hyperparameter, which establishes the proportion of the data that are considered outliers. I found the optimal contamination value to be 0.0001, which I then used to remove outliers from the train data. With outlier detection, the final model produced an accuracy of 95.6% and a recall of 81.7% on the test data.
9.1 XGBoost
By Gabby Bliss
Initially, I compared baseline XGBClassifier() models—one model was fit with the 22 Random Forest selected variables and the response, and the other was fit with all the predictors and the response. The results are pictured below. The accuracies and recalls of these models are comparable, and thus my XGBoost model tuning moved forwards using the train dataset for 22 predictors due to computational efficiency. Computational efficiency was something our group really had to keep in mind while tuning models as we had over 50,000 observations in our training dataset, and is also something we continue to mention and discuss in our limitations section.
For my first pass of hyperparameter tuning (coarse search), I utilized RandomizedSearchCV(), again with computational efficiency in mind. The randomized search was performed with 3 folds for cross validation, and was refitted on our first metric of interest, accuracy. The hyperparameter pool used for this search are shown below, along with the accompanying graphs of each hyperparameter plotted against its 3-Fold accuracy from the search. The scale_pos_weight hyperparameter was held constant at the recommended value, which is the ratio of the number of negative classes to the positive classes. In our case, this value was specifically set to about 4.79.
param_grid = {'n_estimators': [100, 200, 1000, 4000],
'learning_rate': [0.01, 0.1, 0.5, 1.0],
'max_depth': range(2, 23, 2),
'subsample': [0.3, 0.5, 0.6],
'gamma': range(1, 10, 2),
'reg_lambda': [1, 5, 10],
'colsample_bytree': [0.25, 0.5, 0.75],
'scale_pos_weight': [scale_pos_weight]}The plots were used to conduct a more focused search, still refitting on accuracy but with a stratified 5-Fold cross validation algorithm instead of 3-Fold. Using the ranges highlighted on the plots, which were determined based on chosen values and instances of lower, less varied accuracy results, the following set of hyperparameters was used.
param_grid = {'n_estimators': [1000, 1500, 2000],
'learning_rate': [0.01, 0.1],
'max_depth': range(15, 18), 17
'subsample': [0.5, 0.55],
'gamma': [6, 7, 8],
'reg_lambda': [1, 1.5, 2],
'colsample_bytree': [0.5, 0.75],
'scale_pos_weight': [scale_pos_weight]}The hyperparameters returned were {'subsample': 0.55, 'scale_pos_weight': 4.790933009512244, 'reg_lambda': 2, 'n_estimators': 1500, 'max_depth': 17, 'learning_rate': 0.1, 'gamma': 6, 'colsample_bytree': 0.75}, and they produced an XGBClassifier() model with a Test Accuracy of 92.32% and Test Recall of 82.51%. I then did a quick search using higher values of max_features, since the chosen value was on the cusp, but this model performed worse on test data than the previous, most likely due to overfitting. When analyzing the following plots produced by this search, as shown below, it was clear that some parameters optimizing recall, such as n_estimators, gamma, and learning_rate, differed from the parameters chosen by refitting on accuracy.
Thus, on my next search, I decided to create a model refitting our second metric of interest, recall, in order to balance models focused on both metrics in an ultimate ensembling model. The search consisted of the following hyperparameter grid, adjusted to match the ranges visually determined from the previous figure. The search featured a stratified 5-fold cross validation procedure.
param_grid = {'n_estimators': [1500, 2000, 3000, 3500],
'learning_rate': [0.001, 0.01, 0.1],
'max_depth': range(16, 19),
'subsample': [0.53, 0.55, 0.56],
'gamma': [6, 7, 8],
'reg_lambda': [1.5, 2, 3],
'colsample_bytree': [0.5, 0.75, 0.9],
'scale_pos_weight': [scale_pos_weight]}The chosen values were {'subsample': 0.55, 'scale_pos_weight': 4.790933009512244, 'reg_lambda': 3, 'n_estimators': 2000, 'max_depth': 18, 'learning_rate': 0.01, 'gamma': 8, 'colsample_bytree': 0.9} and the model fit produced a Test Accuracy of 90.47% and a Test Recall of 84.72%, illustrating the refitting on recall increased this metric without a major decline of the accuracy metric.
Finally, the best models (one refit on accuracy and one refit on recall) were optimized for decision threshold probabilities, which is a slightly redundant procedure because of the use of the scale_pos_weight argument in the XGBoost models, but nevertheless was performed in order to squeeze a bit more performance out of the models. The graphs used to choose a precise range of thresholds to loop through confirmed this by insinuating the optimal threshold was somewhere between 0.3 and 0.5 for both models.
The decided decision threshold probabilities were 0.431 for the model refit on accuracy and 0.452 for the model refit on recall. Both models, when tuning decision threshold probability, improved the metric that was not refit when tuning the model in searches, at the slight cost of the other metric. The results (Test Accuracy and Test Recall) are displayed below in the final confusion matrices. From qualitative reasoning, the model refit on accuracy seems to fit our objectives better.
A baseline catboost model was also fit to our data, and this can be referenced in the corresponding code document. Not much tuning was performed, but in Next Steps, with more resources, we would have liked to also at least tune the decision threshold and/or the scale_pos_weightequivalent to increase the recall metric score, which was very low in a baseline model.
9.2 Random forest
By Riley Otsuki
For the random forests model, I decided to use the OOB (out-of-bag) score in order to tune my models for the sake of time as opposed to using RandomizedSearchCV or GridSearchCV. In terms of my base model (using the variables selected from prior), I was able to get a test accuracy of 91.93% and a recall of 61.10% as shown below.
Due to the high volume of the dataset, I decided to tune the ‘max_features’ and ‘n_estimators’ which I found to significantly improve the performance of the models. Furthermore, The Elements of Statistical Learning commented that tuning the depth of trees rarely produced increases in model performance, so I decided to focus on the above two hyperparameters.
I first visualized the oob_score in relation to the number of trees (it is important to note that accuracy generally increases with the number of trees). However, I wanted to roughly tune the number of trees in order to increase the model performance. I gradually increased the range of trees ultimately leading to looking at < 2000 trees as seen below:
I then took a similar approach to tuning the ‘max_features’ hyperparameter using out of bag accuracy to visualize the optimal value. Once again, I gradually increased the number of max_features to consider until I reached a point where the out of bag accuracy started to decrease as can be seen:
I then further tuned ‘n_estimators’ and ‘max_featurs’ using the range in which classification accuracy was highest according to the visualizations which allowed me to obtain the final model (‘n_estimators’=1200, ‘max_features’=21). Lastly, I decided to tune the decision threshold as well and visualized the ROC-AUC curve below:
Further tuning of the threshold was done, in which a table which showed the training accuracy and recall were obtained. Ultimately, I decided to use 0.23 as the threshold as it struck a balance between accuracy and recall, the two metrics we set out to optimize. My final model with n_estimators of 1200 and max_features of 21 with a threshold of 0.23 was able to obtain an improved test accuracy of 91.28% and a recall of 91.18%.
9.3 Gradient Boosting
By Kenneth Yeon
For the Gradient Boosting model, I initially began with a baseline model with all the default parameters, which provided us with a slightly-better-than-naive accuracy of 84.59%. However, this initial model resulted in an extremely subpar recall of 35.89% (see figure 1). With the focus of increasing both accuracy and recall, we proceeded to conduct a coarse grid search of the following hyperparameters: n_estimators, learning_rate, max_depth, and subsample.
grid['n_estimators'] = [10,50,100,200,500]
grid['learning_rate'] = [0.0001, 0.001, 0.01, 0.1, 1.0]
grid['max_depth'] = [1,2,3,4,5]
grid['subsample'] = [0.5,1.0]Using these hyperparameter values, the RandomizedSearchCV() produced the following parameters: n_estimators : 500, learning_rate : 0.01, max_depth : 5, and subsample : 1.0. After determining these values, we tuned the decision threshold probability to balance recall and accuracy and got an optimal threshold value of 0.191. With the hyperparameter values and the optimal threshold value, our second gradient boosting model resulted in a slight decrease in accuracy (80.59%), but a significant improvement in recall (80.95%) (see figure 2). In order to improve our model’s accuracy, we conducted a fine grid search using the hyperparameter values below.
grid['n_estimators'] = [500,550]
grid['learning_rate'] = [0.9,1.0,1.1]
grid['max_depth'] = [3,4]
grid['subsample'] = [1.0]GridSearchCV() reported that the most optimal parameters to use for our model would be n_estimators : 550, learning_rate : 0.09, max_depth : 4, and subsample : 1.0. Then we tuned the decision threshold probability again to get a threshold value of 0.09. These values ultimately produced our final model with an increased accuracy of 87.41% and an increased recall of 85.43% (see figure 3).
10 Model Ensemble
10.1 Voting ensemble
After developing each individual model, we ensembled our efforts using VotingRegressor(). With this method, we observed that our ensembled models produced an accuracy of 94.25% and a recall of 80.01% on the test data.
10.2 Stacking ensemble
We then computed a stacking ensemble of our models, with a LogisticRegression() metamodel. The stacking ensemble produced an accuracy of 95.07% and a recall of 79.54% on the test data. When comparing the two ensembling methods, voting resulted in a slightly lower accuracy, but higher recall than stacking on the test data.
10.3 Ensemble of ensembled models
Ensembling the voting and stacking models resulted in an accuracy of 94.68% and a recall of 76.57% on the test data.
11 Limitations of the model with regard to prediction
We know that with more time and resources, optimal hyperparameter values could have been better discovered and implemented—for example, in the XGBoost models, a larger search or comprehensive GridSearch with our size of dataset (50000+ observations/households) could not complete without the runtime disconnecting unless Colab Pro was purchased. Thus, we were particularly limited to performing smaller Randomized Searches, in which some parameters chosen might not have been the most optimal in relation to one another. Additionally, some variables chosen were on the edge of parameter values tested, but lower or higher values compromised computational efficiency and had to be settled upon. XGBoost could thus be better tuned if there were more computational resources to account for the large dataset—more features would have been tested comprehensively with GridSearchCV() rather than RandomizedSearchCV along with a slightly higher amount of n_estimators with a lower learning_rate to see if that combination increased accuracy and/or recall. Additionally, for hyperparameters that can take on non discrete values, an infinite number of hyperparameters could be theoretically tested, meaning that these parameters are most likely not the “true” optimal value, and could be better tuned.
Similarly for AdaBoost model development, additional rounds of tuning could have been performed if there was no time constraint. The process of outlier detection could also be further refined: other parameters of IsolationForest() such as n_estimators and max_features can be tuned to improve the performance of Adaboost, on the test data.
For our random forest model, our variable selection methods were done using random forest, which is most likely why the performance of the random forest was higher than the other models. While this is not a limitation specific to the random forest model, it is an important note to keep in mind as such predictors may not be optimal for the other models developed here.
Because our models are based on survey data, the data used is inherently collectable by stakeholders, as they perform this data collection process already. It is financially reasonable to perform the surveys, as the government does it now, but it might be more financially reasonable to only collect the data on predictors that ended up being used in our models—this way, the government could conduct shorter Census surveys more often and a long one every couple of years to balance streamlining the process and financial allocation to this process with frequent enough data updates so that our model(s) can be used each year rather than more infrequently. Our model will predict food insecurity from these surveys as soon as they are completed in order to address needs as quickly as possible and get aid allocated by the government and/or nonprofits.
Our final ensembled model yielded a considerably high accuracy and recall on the test data, indicating that the approach can effectively forecast instances of food insecurity in the United States. However, it is important to note that our models are based on an annually conducted survey. Food security is a temporary state that can fluctuate with changes in income, politics, season, and general environment. Thus, our model is limited in that it can only make annual conclusions, though food insecurity may occur in situational, smaller periods of time.
12 Conclusions and Recommendations to stakeholder(s)
In conclusion, for all 4 tuned and fitted boosting models, we found that the shared “highest feature importances” predictors seemed to be the variable for Family Income and the variable for the Number of People/Dependents in the household. Thus, these predictors are most pertinent for our stakeholders when conducting surveys and creating models to predict food insecure households. When taking census surveys that may be used in these particular models, these high importance variables for predictive accuracy and high recall are especially important to gather, as they seemed to be most helpful in predicting food insecurity. Our adapted table of importances for each model is exemplified below.
Based upon our ~95% accuracy in classifying households by ensembling boosting models, other conclusions to present to stakeholders are as follows.
First, by performing variable selection and accurately predicting a variable synthesized by a second survey using data from the first survey, our model streamlines the process of gathering data pertinent to food insecurity. Theoretically, if a household survey’s only purpose is for classifying food insecurity, rather than collecting 300+ answers (which can deter people from completing the survey due to time), the survey really only needs to ask 22 questions related to the variables used in our model(s). This would save time and money for our governmental stakeholders, who could use the data faster and post the data faster for our other stakeholders, such as nonprofits, etc. There would also be a trickle-down effect as a faster, more streamlined survey process opens up monetary and temporal flexibility for the government, who can focus energy on the actual allocation of aid rather than gathering 280 more survey questions from households. This directly increases the efficiency and specifics on monetary allocation, translating into faster aid for households in need.
Lastly is our model’s ability to accurately generalize using data from future censuses. Since our model is ensembled, and analyzes trends in survey responses to predict food insecurity, as long as the survey data in coming years has the 22 predictors used in our models, it should be easily generalized.
Although our final ensembled model yielded a considerably high accuracy and recall on the test data, indicating that the approach can effectively forecast instances of food insecurity in the United States, it is important to note that our models are based on an annually conducted survey. Food security is a temporary state that can fluctuate with changes in income, politics, season, and general environment. Thus, our model is limited in that it can only make annual conclusions, though food insecurity may occur in situational, smaller periods of time. Our model can be continuously updated each year if new variables are introduced into the population survey—the variable selection and modeling processes can be repeated for these new changes if necessary. Otherwise, we have attempted to construct the ensemble model to be the most generalizable, so that it is usable as is in future years and does not become fully obsolete after a short period of time.
Individual contribution
| Team member | Individual Model | Work other than individual model | Details of work other than individual model |
|---|---|---|---|
| Gabby Bliss | XGBoost & Basic Catboost | Data Cleaning and Variable selection | Removed impertinent variables, dropped pertinent observations, and coded the variable selection process used in our project |
| Lila Weiner | AdaBoost | Data Visualization & Ensembling | Visualized the predictors obtained from variable selection and coded ensembling models |
| Kenneth Yeon | Gradient Boost | Slide Creation & Report Formatting | Organized slides for presentation and formatted the report |
| Riley Otsuki | Random Forest | Slide Creation & Presentation Format | Formatted organization of slides and presentation of information |
References
[0] Babu, Suresh, Shailendra Gajanan, and Prabuddha Sanyal. Food security, poverty and nutrition policy analysis: statistical methods and applications. Chapter 1, Academic Press, 2014
[1] Christopher B. Barrett, Measuring Food Insecurity. Science 327, 825-828 (2010). DOI: 10.1126/science.1182768
[2] Current Population Survey, December 2021: Food Security Supplement. Bureau of the Census for the Bureau of Labor Statistics. Washington: U.S. Census Bureau, 2021.
[3] “U.S. Hunger Relief Organization.” Feeding America, www.feedingamerica.org/_ga=2.108851586.1854622173.1686184016784292082.1686184016&_gac=1.190916312.1686184016.CjwKAjw1YCkBhAOEiwA5aN4AbuHEiJiCcF4BkhD9zDksL_XOIpY06gs8ZxsQ5hgSZuBtRELJ80VlxoCrjAQAvD_BwE. Accessed 7 June 2023.
[4] “Strategic Plan.” Food Research & Action Center, 19 Oct. 2022, frac.org/strategic-plan.
Appendix
| y | |
|---|---|
| 0 | 59124 |
| 1 | 12409 |
Proportion of Food Insecure Houses: 0.17347238337550502
Proportion of Food Secure Houses: 0.826527616624495
<AxesSubplot:ylabel='y'>
Showing the distribution of variables in tabular form:
| HRHHID | HRMONTH | HRYEAR4 | HURESPLI | HUFINAL | HETENURE | HEHOUSUT | HETELHHD | HETELAVL | HEPHONEO | ... | PEPDEMP1 | PTNMEMP1 | PEPDEMP2 | PTNMEMP2 | PECERT1 | PECERT2 | PECERT3 | PXCERT1 | PXCERT2 | PXCERT3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 112251 | 507340501120403 | 12 | 2021 | 1 | 201 | 3 | 1 | 1 | -1 | 1 | ... | -1 | -1 | -1 | -1 | 2 | -1 | -1 | 0 | 0 | 0 |
| 89743 | 578660581951003 | 12 | 2021 | 1 | 201 | 3 | 5 | 1 | -1 | 1 | ... | -1 | -1 | -1 | -1 | 2 | -1 | -1 | 0 | 0 | 0 |
| 114059 | 967582201340105 | 12 | 2021 | 1 | 201 | 2 | 5 | 1 | -1 | 1 | ... | -1 | -1 | -1 | -1 | 2 | -1 | -1 | 0 | 0 | 0 |
| 33973 | 410100031131678 | 12 | 2021 | 1 | 201 | 2 | 1 | 1 | -1 | 1 | ... | -1 | -1 | -1 | -1 | 2 | -1 | -1 | 20 | 0 | 0 |
| 20904 | 219010610013449 | 12 | 2021 | 4 | 201 | 1 | 1 | 1 | -1 | 1 | ... | -1 | -1 | -1 | -1 | 2 | -1 | -1 | 20 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 18379 | 310946170108105 | 12 | 2021 | 1 | 201 | 3 | 1 | 1 | -1 | 1 | ... | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 |
| 94077 | 411809460586216 | 12 | 2021 | 1 | 201 | 1 | 1 | 1 | -1 | 1 | ... | -1 | -1 | -1 | -1 | 2 | -1 | -1 | 0 | 0 | 0 |
| 8856 | 40710040307671 | 12 | 2021 | 1 | 1 | 2 | 1 | 1 | -1 | 1 | ... | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 |
| 98134 | 28720188901004 | 12 | 2021 | 1 | 201 | 1 | 1 | 1 | -1 | 1 | ... | -1 | -1 | -1 | -1 | 1 | 2 | 2 | 0 | 0 | 0 |
| 9472 | 12076304105461 | 12 | 2021 | 2 | 1 | 1 | 1 | 1 | -1 | 1 | ... | -1 | -1 | -1 | -1 | 2 | -1 | -1 | 20 | 0 | 0 |
57226 rows × 390 columns
| HETENURE | HEFAMINC | HWHHWGT | HRNUMHOU | HRHTYPE | HRMIS | HRHHID2 | GEDIV | GESTFIPS | GTCBSA | ... | GTCSA | PRTAGE | PEEDUCA | PWFMWGT | PWLGWGT | PWSSWGT | PWVETWGT | QSTNUM | PWCMPWGT | HXFAMINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 112251 | 3 | 13 | 34491316 | 3 | 4 | 5 | 12111 | 6 | 47 | 0 | ... | 0 | 26 | 39 | 29445158 | 0 | 29445158 | 30622270 | 56171 | 30649540 | 42 |
| 89743 | 3 | 3 | 12853838 | 2 | 4 | 5 | 12012 | 8 | 35 | 22140 | ... | 0 | 29 | 35 | 12853838 | 0 | 12853838 | 11304924 | 44370 | 11504978 | 0 |
| 114059 | 2 | 10 | 37756466 | 5 | 1 | 6 | 12011 | 3 | 17 | 44100 | ... | 0 | 28 | 40 | 37756466 | 0 | 37756466 | 37957235 | 57193 | 37991036 | 23 |
| 33973 | 2 | 16 | 40579130 | 2 | 1 | 7 | 12011 | 3 | 18 | 26900 | ... | 0 | 31 | 43 | 40579130 | 60302779 | 40579130 | 39956488 | 17188 | 40140673 | 0 |
| 20904 | 1 | 16 | 45467838 | 3 | 1 | 8 | 12011 | 5 | 24 | 12580 | ... | 548 | 53 | 44 | 45467838 | 67567663 | 45467838 | 45859570 | 10727 | 45859610 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 18379 | 3 | 13 | 21372705 | 4 | 1 | 2 | 13011 | 6 | 1 | 13820 | ... | 0 | 10 | -1 | 23960728 | 0 | 23960728 | 0 | 9349 | 0 | 0 |
| 94077 | 1 | 11 | 33387906 | 2 | 6 | 5 | 12012 | 9 | 6 | 31080 | ... | 348 | 37 | 43 | 33387906 | 0 | 33387906 | 32546139 | 46475 | 33223156 | 0 |
| 8856 | 2 | 1 | 40151439 | 2 | 4 | 2 | 13011 | 6 | 47 | 34980 | ... | 0 | 10 | -1 | 49328281 | 0 | 49328281 | 0 | 4651 | 0 | 0 |
| 98134 | 1 | 15 | 28083647 | 2 | 1 | 5 | 12011 | 7 | 40 | 36420 | ... | 0 | 48 | 42 | 28083647 | 0 | 22532325 | 22323298 | 48327 | 22715863 | 0 |
| 9472 | 1 | 14 | 10070023 | 2 | 1 | 3 | 13111 | 8 | 16 | 14260 | ... | 0 | 55 | 40 | 10070023 | 15682620 | 10520687 | 13552245 | 4922 | 10858623 | 0 |
57226 rows × 22 columns
HETENURE
HEFAMINC
HWHHWGT
HRNUMHOU
HRHTYPE
HRMIS
HRHHID2
GEDIV
GESTFIPS
GTCBSA
GTCO
GTCBSASZ
GTCSA
PRTAGE
PEEDUCA
PWFMWGT
PWLGWGT
PWSSWGT
PWVETWGT
QSTNUM
PWCMPWGT
HXFAMINC
112251 42
89743 0
114059 23
33973 0
20904 0
..
18379 0
94077 0
8856 0
98134 0
9472 0
Name: HXFAMINC, Length: 57226, dtype: int64
# Categorical - ['HETENURE - 3', 'HEFAMINC - 16', 'HRHTYPE - 10', 'GEDIV - 9', 'GESTFIPS - 56', 'GTCO - 811', 'GTCBSASZ- 8', 'PEEDUCA - 16', 'HXFAMINC']
# Continuous - ['HWHHWGT', 'HRNUMHOU', 'HRMIS', 'HRHHID2', 'GTCBSA', 'GTCSA', 'PRTAGE', 'PWFMWGT', 'PWLGWGT', 'PWSSWGT', 'PWVETWGT', 'QSTNUM', 'PWCMPWGT']Data Quality Check, Categorical Variables
| Top 3 Value Counts, Categorical | ||
|---|---|---|
| HETENURE | 1 | 40965 |
| 2 | 15664 | |
| 3 | 597 | |
| HEFAMINC | 16 | 10024 |
| 15 | 9151 | |
| 14 | 7725 | |
| HRHTYPE | 1 | 34100 |
| 4 | 8096 | |
| 7 | 5588 | |
| GEDIV | 5 | 9729 |
| 9 | 9368 | |
| 8 | 7371 | |
| GESTFIPS | 6 | 5586 |
| 48 | 3181 | |
| 12 | 2186 | |
| GTCO | 0 | 33940 |
| 3 | 2085 | |
| 1 | 1940 | |
| GTCBSASZ | 0 | 14807 |
| 7 | 10659 | |
| 5 | 8736 | |
| PEEDUCA | 39 | 12920 |
| 43 | 10304 | |
| -1 | 9995 | |
| HXFAMINC | 0 | 47336 |
| 23 | 5791 | |
| 43 | 1994 |
| Null Counts, Categorical in Dataset | |
|---|---|
| HETENURE | 0 |
| HEFAMINC | 0 |
| HRHTYPE | 0 |
| GEDIV | 0 |
| GESTFIPS | 0 |
| GTCO | 0 |
| GTCBSASZ | 0 |
| PEEDUCA | 0 |
| HXFAMINC | 0 |
| Unique Counts, Categorical in Training Set | |
|---|---|
| HETENURE | 3 |
| HEFAMINC | 16 |
| HRHTYPE | 10 |
| GEDIV | 9 |
| GESTFIPS | 51 |
| GTCO | 100 |
| GTCBSASZ | 7 |
| PEEDUCA | 17 |
| HXFAMINC | 5 |
Data Quality Check, Continuous Variables
| HWHHWGT | HRNUMHOU | HRMIS | HRHHID2 | GTCBSA | GTCSA | PRTAGE | PWFMWGT | PWLGWGT | PWSSWGT | PWVETWGT | QSTNUM | PWCMPWGT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5.722600e+04 | 57226.000000 | 57226.000000 | 57226.000000 | 57226.000000 | 57226.000000 | 57226.000000 | 5.722600e+04 | 5.722600e+04 | 5.722600e+04 | 5.722600e+04 | 57226.000000 | 5.722600e+04 |
| mean | 3.125241e+07 | 3.170185 | 4.522000 | 12647.130937 | 22894.188306 | 141.785902 | 41.306836 | 3.190562e+07 | 2.593636e+07 | 3.187452e+07 | 2.567212e+07 | 27284.621675 | 2.567593e+07 |
| std | 1.711774e+07 | 1.671008 | 2.297608 | 696.110824 | 16564.049187 | 189.830680 | 23.558046 | 1.827584e+07 | 2.985351e+07 | 1.840632e+07 | 2.016734e+07 | 16289.276113 | 2.011425e+07 |
| min | 2.261556e+06 | 1.000000 | 1.000000 | 12011.000000 | 0.000000 | 0.000000 | 0.000000 | 1.764603e+06 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000 | 0.000000e+00 |
| 25% | 1.538791e+07 | 2.000000 | 3.000000 | 12011.000000 | 0.000000 | 0.000000 | 21.000000 | 1.549287e+07 | 0.000000e+00 | 1.537878e+07 | 5.637844e+06 | 13676.000000 | 5.620554e+06 |
| 50% | 3.502429e+07 | 3.000000 | 5.000000 | 12112.000000 | 26420.000000 | 0.000000 | 41.000000 | 3.488938e+07 | 1.015628e+07 | 3.482852e+07 | 2.673053e+07 | 26702.000000 | 2.673476e+07 |
| 75% | 4.364033e+07 | 4.000000 | 7.000000 | 13011.000000 | 37980.000000 | 348.000000 | 61.000000 | 4.432648e+07 | 5.507388e+07 | 4.435773e+07 | 4.182940e+07 | 40553.000000 | 4.192015e+07 |
| max | 1.303230e+08 | 14.000000 | 8.000000 | 14111.000000 | 49740.000000 | 548.000000 | 85.000000 | 1.858748e+08 | 1.891331e+08 | 1.858748e+08 | 1.542089e+08 | 68410.000000 | 1.542089e+08 |